• Introduction
  • Exploratory Data Analysis
    • Initial Data Analysis
    • Feature Engineering/Cleaning
    • Visualization
  • Data Preprocessing
    • Split Dateset
    • Transform Categorical and Numerical features
  • Models
    • Metrics Definitions
  • Model Training and Validation
    • Logistic Regression
      • Paramters
      • Validation Metrics
      • Feature Importance
    • Support Vector Machine
      • Training
      • Validation Metrics
    • Decision Tree
      • Training
      • Validation Metrics
      • Feature Importance
    • Random Forest
      • Training
      • Validation Metrics
      • Feature Importance
    • AdaBoost
      • Training
      • Validation Metrics
      • Feature Importance
    • Gradient Boosting
      • Training
      • Validation Metrics
      • Feature Importance
    • Extreme Gradient Boosting
      • Training
      • Validation Metrics
      • Feature Importance
    • Light GBM
      • Training
      • Validation Metrics
      • Feature Importance
    • Validation Score Comparison
  • Test Set
    • Support Vector Machine
    • Random Forest
    • Ada Boost
    • Gradient Boosting
    • Extreme Gradient Boosting
    • Light LGB
    • Metric Comparison
  • Selected Best Models
    • Pre-processing
    • Split data
    • Split data Train/Test
    • Transform Categorical and NumericaL features
    • Cross Validation
      • Support Vector Machine
      • Random Forest
      • AdaBoost
      • Gradient Boost
      • Light LGB
      • Cross validation Score Comparison
    • Test Data
      • Support Vector Machine
      • Random Forrest
      • AdaBoost
      • Gradient Boosting
      • Light Gradient Boost
    • Test Scores
    • Feature Importance
      • Random Forest
      • AdaBoost
      • Gradient Boosting
      • Light Gradient Boosting
    • Final Model Metrics
  • Insights



Introduction

Insurance fraud is a concern of many sectors such as health care, homeowners, and automobile. Insurance fraud is not only costly to insurers but also affects non fraudulent policy holders.

This analysis will focus on fraud within the auto insurance industry in India. The data used for this project was downloaded from Kaggle. https://www.kaggle.com/

Our goal is to use classification models for predicting which auto insurance claims are fraudulent. Several classification models will be assessed on their ability to successfully predict actual fraud.

individuals may not be interested in all sections of this analysis. Sections of interest can be directly access through the table of contents on the left. An example is if one would prefer going directly to the classification models clicking on Models will accopmplish that.

This project can be viewed with the accompanying code by following the below link.

Insurance Fraud with Code

The following programs were used for this project.

Python 3.10.10

R 4.2.2 (Specific Visualizations)

RStudio 2023.03.1+446 (For document output)





Exploratory Data Analysis



Initial Data Analysis



The data was downloaded in individual five data sets. We will review each data set for suitability of being merged into one data set.





## ************Train_Claim_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 19 columns):
##  #   Column                 Non-Null Count  Dtype 
## ---  ------                 --------------  ----- 
##  0   CustomerID             28836 non-null  object
##  1   DateOfIncident         28836 non-null  object
##  2   TypeOfIncident         28836 non-null  object
##  3   TypeOfCollission       28836 non-null  object
##  4   SeverityOfIncident     28836 non-null  object
##  5   AuthoritiesContacted   28836 non-null  object
##  6   IncidentState          28836 non-null  object
##  7   IncidentCity           28836 non-null  object
##  8   IncidentAddress        28836 non-null  object
##  9   IncidentTime           28836 non-null  int32 
##  10  NumberOfVehicles       28836 non-null  int32 
##  11  PropertyDamage         28836 non-null  object
##  12  BodilyInjuries         28836 non-null  int32 
##  13  Witnesses              28836 non-null  object
##  14  PoliceReport           28836 non-null  object
##  15  AmountOfInjuryClaim    28836 non-null  int32 
##  16  AmountOfPropertyClaim  28836 non-null  int32 
##  17  AmountOfVehicleDamage  28836 non-null  int32 
##  18  AmountOfTotalClaim     28836 non-null  int32 
## dtypes: int32(7), object(12)
## memory usage: 3.4+ MB



## ************Train_Policy_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 10 columns):
##  #   Column                      Non-Null Count  Dtype  
## ---  ------                      --------------  -----  
##  0   InsurancePolicyNumber       28836 non-null  int32  
##  1   CustomerLoyaltyPeriod       28836 non-null  int32  
##  2   DateOfPolicyCoverage        28836 non-null  object 
##  3   InsurancePolicyState        28836 non-null  object 
##  4   Policy_CombinedSingleLimit  28836 non-null  object 
##  5   Policy_Deductible           28836 non-null  int32  
##  6   PolicyAnnualPremium         28836 non-null  float64
##  7   UmbrellaLimit               28836 non-null  int32  
##  8   InsuredRelationship         28836 non-null  object 
##  9   CustomerID                  28836 non-null  object 
## dtypes: float64(1), int32(4), object(5)
## memory usage: 1.8+ MB



## ************Train_Demographics_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 10 columns):
##  #   Column                 Non-Null Count  Dtype 
## ---  ------                 --------------  ----- 
##  0   CustomerID             28836 non-null  object
##  1   InsuredAge             28836 non-null  int32 
##  2   InsuredZipCode         28836 non-null  int32 
##  3   InsuredGender          28836 non-null  object
##  4   InsuredEducationLevel  28836 non-null  object
##  5   InsuredOccupation      28836 non-null  object
##  6   InsuredHobbies         28836 non-null  object
##  7   CapitalGains           28836 non-null  int32 
##  8   CapitalLoss            28836 non-null  int32 
##  9   Country                28836 non-null  object
## dtypes: int32(4), object(6)
## memory usage: 1.8+ MB



## **********Traindata_with_Targeet_p Information**********
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 2 columns):
##  #   Column         Non-Null Count  Dtype 
## ---  ------         --------------  ----- 
##  0   CustomerID     28836 non-null  object
##  1   ReportedFraud  28836 non-null  object
## dtypes: object(2)
## memory usage: 450.7+ KB



## ************Train_Vehicle_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 115344 entries, 0 to 115343
## Data columns (total 3 columns):
##  #   Column                   Non-Null Count   Dtype 
## ---  ------                   --------------   ----- 
##  0   CustomerID               115344 non-null  object
##  1   VehicleAttribute         115344 non-null  object
##  2   VehicleAttributeDetails  115344 non-null  object
## dtypes: object(3)
## memory usage: 2.6+ MB



## *************Train_Vehicle_p First 25 Rows*************
##    CustomerID VehicleAttribute VehicleAttributeDetails
## 0   Cust20179        VehicleID             Vehicle8898
## 1   Cust21384     VehicleModel                  Malibu
## 2   Cust33335      VehicleMake                  Toyota
## 3   Cust27118     VehicleModel                    Neon
## 4   Cust13038        VehicleID            Vehicle30212
## 5    Cust1801        VehicleID            Vehicle24096
## 6   Cust30237     VehicleModel                     RAM
## 7   Cust21334       VehicleYOM                    1996
## 8   Cust26634       VehicleYOM                    1999
## 9   Cust20624      VehicleMake               Chevrolet
## 10  Cust14947        VehicleID            Vehicle15216
## 11  Cust21432       VehicleYOM                    2002
## 12  Cust22845       VehicleYOM                    2000
## 13   Cust9006      VehicleMake                  Accura
## 14  Cust30659       VehicleYOM                    2003
## 15  Cust18447      VehicleMake                   Honda
## 16  Cust19144        VehicleID            Vehicle29018
## 17  Cust26846        VehicleID            Vehicle21867
## 18   Cust4801       VehicleYOM                    1998
## 19  Cust18081       VehicleYOM                    2013
## 20  Cust17021      VehicleMake                     BMW
## 21  Cust30660       VehicleYOM                    2002
## 22  Cust22099        VehicleID            Vehicle30877
## 23  Cust33560       VehicleYOM                    2011
## 24  Cust17371       VehicleYOM                    2001





The data sets train claim, train policy, train demographics, and train with target are ready to be merged into one data set.

Viewing the first twenty-five rows of the Train Vehicle data column VehicleAttribute we can see that it has multiple repeating rows as each customerID is as associated with Vehicle Model, Vehicle Make, Vehicle ID, and Vehicle YOM. The number of rows is 115344 which is four times the rows of the other data sets. This data set will have to be modified before it can be merged with the other data sets. Each level should be an individual feature matching to its corresponding level in the VehicleAtributeDetails feature. This will be accomplished by making the Train Vehicle data set wider. We will spread out the Vehicle Attribute feature so each level will become a feature. This will create a new data set that is shorter and wider.







## ************train_vehicle_wide Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 5 columns):
##  #   Column        Non-Null Count  Dtype 
## ---  ------        --------------  ----- 
##  0   CustomerID    28836 non-null  object
##  1   VehicleID     28836 non-null  object
##  2   VehicleMake   28836 non-null  object
##  3   VehicleModel  28836 non-null  object
##  4   VehicleYOM    28836 non-null  object
## dtypes: object(5)
## memory usage: 1.1+ MB



## *************train_vehicle_wide first 50 rows*************
## VehicleAttribute CustomerID     VehicleID VehicleMake VehicleModel VehicleYOM
## 0                 Cust10000  Vehicle26917        Audi           A5       2008
## 1                 Cust10001  Vehicle15893        Audi           A5       2006
## 2                 Cust10002   Vehicle5152  Volkswagen        Jetta       1999
## 3                 Cust10003  Vehicle37363  Volkswagen        Jetta       2003
## 4                 Cust10004  Vehicle28633      Toyota          CRV       2010
## 5                 Cust10005  Vehicle26409      Toyota          CRV       2011
## 6                 Cust10006  Vehicle12114    Mercedes         C300       2000
## 7                 Cust10007  Vehicle26987      Suburu         C300       2010
## 8                 Cust10009  Vehicle12490  Volkswagen       Passat       1995
## 9                  Cust1001  Vehicle28516        Saab          92x       2004
## 10                Cust10011   Vehicle8940      Nissan       Ultima       2002
## 11                Cust10012   Vehicle9379        Ford       Fusion       2004
## 12                Cust10013  Vehicle22024      Accura       Fusion       2001
## 13                Cust10014   Vehicle3601      Suburu      Impreza       2011
## 14                Cust10016   Vehicle7515        Saab          92x       2005
## 15                Cust10017  Vehicle31838        Saab          92x       2005
## 16                Cust10018  Vehicle35954      Toyota           93       2000
## 17                Cust10019  Vehicle19647        Saab           93       2000
## 18                Cust10021  Vehicle37694  Volkswagen       Passat       2006
## 19                Cust10022  Vehicle31889      Toyota   Highlander       1997
## 20                Cust10023  Vehicle10464      Toyota   Highlander       1999
## 21                Cust10024  Vehicle24452       Dodge           X5       2001
## 22                Cust10025  Vehicle12734       Dodge           X5       2002
## 23                Cust10026  Vehicle14492  Volkswagen       Passat       2001
## 24                Cust10027  Vehicle38970        Saab       Passat       1995
## 25                Cust10028   Vehicle3996       Honda       Accord       2015
## 26                Cust10029  Vehicle12477      Toyota      Corolla       2015
## 27                Cust10030  Vehicle34293        Ford    Forrestor       2006
## 28                Cust10031  Vehicle33775      Suburu         F150       2005
## 29                Cust10032  Vehicle34708      Nissan   Pathfinder       2012
## 30                Cust10034  Vehicle26030        Saab          92x       2006
## 31                Cust10035   Vehicle3961        Saab        Jetta       2007
## 32                Cust10037  Vehicle38667       Dodge         Neon       2012
## 33                 Cust1004  Vehicle17051   Chevrolet        Tahoe       2014
## 34                Cust10040   Vehicle7284        Audi     Wrangler       2007
## 35                Cust10041   Vehicle2119        Jeep           A3       2008
## 36                Cust10042   Vehicle7459      Accura           A5       1997
## 37                Cust10043   Vehicle6244      Accura          RSX       2010
## 38                Cust10044  Vehicle38446   Chevrolet       Malibu       1998
## 39                Cust10046   Vehicle3199        Audi           A5       2011
## 40                Cust10047  Vehicle13780        Audi           A5       2009
## 41                Cust10049  Vehicle35318        Ford         F150       2008
## 42                 Cust1005  Vehicle26158      Accura          RSX       2009
## 43                Cust10051  Vehicle33864       Dodge         E400       2014
## 44                Cust10052  Vehicle16314       Honda       Legacy       2002
## 45                Cust10053  Vehicle35570      Suburu       Legacy       2000
## 46                Cust10054  Vehicle13054        Audi       Ultima       2006
## 47                Cust10057  Vehicle23410      Suburu       Legacy       2005
## 48                Cust10058  Vehicle24044         BMW          92x       2005
## 49                Cust10059  Vehicle25575         BMW           X5       2006



We have taken the data from train vehicle and created a new data set called train vehicle wide. This new data set has four new columns and 28836 rows which now matches the other four data sets. We are now ready to merge all data sets.





## *******************fraud Information*******************
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28836 entries, 0 to 28835
## Data columns (total 42 columns):
##  #   Column                      Non-Null Count  Dtype  
## ---  ------                      --------------  -----  
##  0   CustomerID                  28836 non-null  object 
##  1   DateOfIncident              28836 non-null  object 
##  2   TypeOfIncident              28836 non-null  object 
##  3   TypeOfCollission            28836 non-null  object 
##  4   SeverityOfIncident          28836 non-null  object 
##  5   AuthoritiesContacted        28836 non-null  object 
##  6   IncidentState               28836 non-null  object 
##  7   IncidentCity                28836 non-null  object 
##  8   IncidentAddress             28836 non-null  object 
##  9   IncidentTime                28836 non-null  int32  
##  10  NumberOfVehicles            28836 non-null  int32  
##  11  PropertyDamage              28836 non-null  object 
##  12  BodilyInjuries              28836 non-null  int32  
##  13  Witnesses                   28836 non-null  object 
##  14  PoliceReport                28836 non-null  object 
##  15  AmountOfInjuryClaim         28836 non-null  int32  
##  16  AmountOfPropertyClaim       28836 non-null  int32  
##  17  AmountOfVehicleDamage       28836 non-null  int32  
##  18  AmountOfTotalClaim          28836 non-null  int32  
##  19  InsuredAge                  28836 non-null  int32  
##  20  InsuredZipCode              28836 non-null  int32  
##  21  InsuredGender               28836 non-null  object 
##  22  InsuredEducationLevel       28836 non-null  object 
##  23  InsuredOccupation           28836 non-null  object 
##  24  InsuredHobbies              28836 non-null  object 
##  25  CapitalGains                28836 non-null  int32  
##  26  CapitalLoss                 28836 non-null  int32  
##  27  Country                     28836 non-null  object 
##  28  InsurancePolicyNumber       28836 non-null  int32  
##  29  CustomerLoyaltyPeriod       28836 non-null  int32  
##  30  DateOfPolicyCoverage        28836 non-null  object 
##  31  InsurancePolicyState        28836 non-null  object 
##  32  Policy_CombinedSingleLimit  28836 non-null  object 
##  33  Policy_Deductible           28836 non-null  int32  
##  34  PolicyAnnualPremium         28836 non-null  float64
##  35  UmbrellaLimit               28836 non-null  int32  
##  36  InsuredRelationship         28836 non-null  object 
##  37  VehicleID                   28836 non-null  object 
##  38  VehicleMake                 28836 non-null  object 
##  39  VehicleModel                28836 non-null  object 
##  40  VehicleYOM                  28836 non-null  object 
##  41  ReportedFraud               28836 non-null  object 
## dtypes: float64(1), int32(15), object(26)
## memory usage: 7.8+ MB



Feature Engineering/Cleaning





Feature engineering includes several steps.

First is feature creation. We create new variables from existing features which will help our model and data visualization.

Secondly, we can transform features from one representation to another. An example would be transforming a feature that is numerical to a type categorical.

Cleaning is the process of viewing the features and if something is not adding up with a feature, we can remove the values creating the problem or remove the feature entirely. An example is null values. We can replace a null value with another value, remove null values from the data set, or as mentioned before, remove the feature entirely.





Certain features are numeric yet may better serve our models as categorical. This can be assessed by checking unique values of these features



## ******** Unique Number of Vehicles********
## [3 1 4 2]



## ******** Unique Bodily Injuries********
## [1 2 0]



The above outputs indicate that both NumberOfVehcicles and BodilyInjuries would be best as type categorical. We will create a function that converts numerical data types to categorical. Then the function will be applied to the selected numerical features.

## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28836 entries, 0 to 28835
## Data columns (total 2 columns):
##  #   Column            Non-Null Count  Dtype   
## ---  ------            --------------  -----   
##  0   NumberOfVehicles  28836 non-null  category
##  1   BodilyInjuries    28836 non-null  category
## dtypes: category(2)
## memory usage: 281.9 KB



Both features are now of type category



## *************Incident Time Unique Values*************
## [17 10 22  7 20 18  3  5 14 16 15 13 12  9 19  4 11  1  8  0  6 21 23  2
##  -5]



IncidentTime has unique values that would warrant it becoming categorical, though the many levels would not be optimal for use in our modeling. We can remedy this by placing unique time values into bins using a Python dictionary. This will reduce the number of levels.

## ***Incident Period Day Value Counts***
## night              7458
## early afternoon    5785
## early morning      5580
## late morning       3661
## late afternoon     3231
## evening            2699
## Name: IncidentPeriodDay, dtype: int64



We find from the value count output for the new feature IncidentPeriodDay that incident times have been placed into six unique periods of the day.



Date features used in creatingnew features are no longer required and will be removed from the data set

For purposes of classification algorithms and visualizations we’ll need to convert all categorical columns (Object Data Type) to the category data type. This will be accomplished by creating a function to identify non-numerical columns and converting them to the category data type.

## *****************fraud_v3 Information*****************
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28836 entries, 0 to 28835
## Data columns (total 42 columns):
##  #   Column                      Non-Null Count  Dtype   
## ---  ------                      --------------  -----   
##  0   CustomerID                  28836 non-null  category
##  1   TypeOfIncident              28836 non-null  category
##  2   TypeOfCollission            28836 non-null  category
##  3   SeverityOfIncident          28836 non-null  category
##  4   AuthoritiesContacted        28836 non-null  category
##  5   IncidentState               28836 non-null  category
##  6   IncidentCity                28836 non-null  category
##  7   IncidentAddress             28836 non-null  category
##  8   NumberOfVehicles            28836 non-null  category
##  9   PropertyDamage              28836 non-null  category
##  10  BodilyInjuries              28836 non-null  category
##  11  Witnesses                   28836 non-null  category
##  12  PoliceReport                28836 non-null  category
##  13  AmountOfInjuryClaim         28836 non-null  int32   
##  14  AmountOfPropertyClaim       28836 non-null  int32   
##  15  AmountOfVehicleDamage       28836 non-null  int32   
##  16  AmountOfTotalClaim          28836 non-null  int32   
##  17  InsuredAge                  28836 non-null  int32   
##  18  InsuredZipCode              28836 non-null  int32   
##  19  InsuredGender               28836 non-null  category
##  20  InsuredEducationLevel       28836 non-null  category
##  21  InsuredOccupation           28836 non-null  category
##  22  InsuredHobbies              28836 non-null  category
##  23  CapitalGains                28836 non-null  int32   
##  24  CapitalLoss                 28836 non-null  int32   
##  25  Country                     28836 non-null  category
##  26  InsurancePolicyNumber       28836 non-null  int32   
##  27  CustomerLoyaltyPeriod       28836 non-null  int32   
##  28  InsurancePolicyState        28836 non-null  category
##  29  Policy_CombinedSingleLimit  28836 non-null  category
##  30  Policy_Deductible           28836 non-null  int32   
##  31  PolicyAnnualPremium         28836 non-null  float64 
##  32  UmbrellaLimit               28836 non-null  int32   
##  33  InsuredRelationship         28836 non-null  category
##  34  VehicleID                   28836 non-null  category
##  35  VehicleMake                 28836 non-null  category
##  36  VehicleModel                28836 non-null  category
##  37  VehicleYOM                  28836 non-null  category
##  38  ReportedFraud               28836 non-null  category
##  39  coverageIncidentDiff        28836 non-null  float64 
##  40  dayOfWeek                   28836 non-null  category
##  41  IncidentPeriodDay           28414 non-null  category
## dtypes: category(28), float64(2), int32(12)
## memory usage: 5.3 MB



From the above output we observe that all object data types are now type categorical.



Figure 1

Figure 1





## (array([0.5, 1.5, 2.5, 3.5]), [Text(0, 0.5, 'Multi-vehicle Collision'), Text(0, 1.5, 'Parked Car'), Text(0, 2.5, 'Single Vehicle Collision'), Text(0, 3.5, 'Vehicle Theft')])
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0.5, 0, '?'), Text(1.5, 0, 'Front Collision'), Text(2.5, 0, 'Rear Collision'), Text(3.5, 0, 'Side Collision')])





Figure 2

Figure 2





We observe from the cross table that the ‘unknown’ type of collision is only associated with a small number of incident types related to collisions. These data points will be retained by renaming the “unknown” column to “none”.

Figure 3

Figure 3



Figure 4

Figure 4

From figure 4 we detect certain features that must be dealt with due to missing values. First, the property damage feature will be dropped due to many observations having no answer which is denoted by a question mark.



Next, the category MISSINGVALUE from the Witnesses feature will be dropped.





Figure 5

Figure 5



Figure 6

Figure 6

Figure Figure 6 informs us that there are additional categorical features which must be either cleaned or dropped. First, the feature Police Report has close to 10000 missing values (denoted by a question mark). This feature will be dropped.



The next feature requiring attention is InsuredGender. There are a small number of missing values, denoted by NA. This category will be removed from InsuredGender. The omission of this small count category will have no effect on our models.





Figure 7

Figure 7



## *******premium_missing shape*******
## (141, 40)



## *******fraud_v6 shape*******
## (28836, 40)





Figure 8

Figure 8



VehicleMake has a small number of missing values (denoted by ‘???’). The category ‘???’ will be removed from the feature.





## (0.0, 2535.75)







Figure 9

Figure 9





Figure 9 displays the VechicleMake feature with no missing values.



Filtering for any PolicyAnnualPremium value that is equal to -1 we find 141 values returned. From the Attribute Information pdf provided with the data set we know that -1 represents a missing value. All observations with -1 will be removed.







From the size output we can observe all values of -1 have been removed.





Certain visualizations require numeric only data. We’ll create a date set that contains only numeric data types.

## ******************Numeric Data Types******************
## AmountOfInjuryClaim        int32
## AmountOfPropertyClaim      int32
## AmountOfVehicleDamage      int32
## AmountOfTotalClaim         int32
## InsuredAge                 int32
## CapitalGains               int32
## CapitalLoss                int32
## CustomerLoyaltyPeriod      int32
## Policy_Deductible          int32
## PolicyAnnualPremium      float64
## UmbrellaLimit              int32
## coverageIncidentDiff     float64
## dtype: object



The data set numeric_data only has features of numeric data types as seen from the above output.



Figure 10

Figure 10





There is very high to high correlation between Amount of Injury Claim, Amount of Property Claim, Amount of Vehicle Damage, and Amount of Total Claim. This is unsurprising as Amount of Total Claim is the sum of the other three. Amount of Total Claim is the only feature of the four that will be used for our machine learning models.

Other features exhibiting very high correlation are Loyalty period and Age. This makes sense as older customers have the chance to accrue loyalty time based on having lived longer than younger customers. Still, we will retain both features for our models.

Features not important for visualizing or building models will be dropped.



## **fraud_v8 shape**
## (28695, 33)



Visualization







Figure 11

Figure 11





Based on the subplots from Figure 11, we observe that certain numeric features have outliers. We’ll take a closer look at those features.



Figure 13

Figure 13



The box plots from Figure 13 displaying Amount of Total Claim for the different events of ReportedFraud are interesting. For Reported Fraud=Y there are many outliers which are below 22000. Data points falling under 22000 for ReportedFraud=’N” is not considered outliers.

We’ll further check outliers by viewing the histograms from Figure 13. Jumping out is the distribution of AmountOfTotalClaim has two distinct peaks. That is, it is “bimodal”. The peaks in any distribution are the most common number(s). The distribution of Total Claim Reported is due to multiple values occurring most frequently. Data values that occur the most often in a data set is the mode.

The second histogram from Figure 13 superimposes the two events. The superimposed histogram follows the same bi-modal distribution as the single histogram. The outliers are of no concern and will not be removed from the data set.







Figure 14

Figure 14





The box plots from Figure 4 show outliers above the age of 60 for both reported fraud events. In addition, both histograms from Figure 14 have slight skews to the right. Looking closer at the subplots, this appears to be due to drivers over the age of 50. Drivers over the age of 50 or 60 seeking auto insurance coverage is not unusual. Since the outliers are not unusual, they will not be removed or transformed.





Figure 15

Figure 15





The boxplots from Figure 15 both have outliers at the lower and higher ends. There are outliers on both higher and lower ends of both box plots. It’s difficult to determine from the first histogram if there is a skew (tail). The mean of 1261 is slightly less than the median of 1266 which tells us there’s a small skew to the left. There are a few small values of Policy Annual Premium that are driving the mean down. The third plot is of two histograms superimposed based on Reported Fraud event. Reported Fraud=Y skews slight to the left. The mean of 1255 is less than the median of 1271 which supports the left skew. The histogram for Reported Fraud=N appears normally distributed which is when the mean and median are the same. The mean and median for Reported Fraud=N are the same at 1263 thus normal distribution is confirmed.

Based on the statistical analysis, the skew of the first histogram is primarily caused by lower premiums of data points reported as fraud. This data can be important during a model building. We’ll at addtional data for determining whether to keep these outliers.





## **** Year Of Make****
## 2015     416
## 1995     531
## 1996     828
## 2014     871
## 1997    1131
## 2013    1256
## 1998    1276
## 2012    1308
## 2001    1428
## 1999    1479
## 2011    1518
## 2000    1523
## 2002    1527
## 2003    1571
## 2008    1622
## 2009    1623
## 2010    1631
## 2005    1635
## 2006    1637
## 2004    1661
## 2007    1709
## Name: VehicleYOM, dtype: int64



Figure 16

Figure 16



Auto insurance premiums are generally based on personal details like choice of coverage, type of vehicle driven, and age of car. The newer the car, typically the more expensive the insurance. This is based on the vehicle’s replacement cost. The year the car was manufactured plays just as big a part in the premium as the make and model itself. Figure 16 displays all years of vehicle make in our data set. We find that there are just over 6000 auto’s that have an age of 15 years or greater compared to the last year of 2015. The number of older vehicles explains the outliers at the lower end of the box plots form Figure 15. These outliers are not unusual thus will not be removed or transformed.



Figure 17

Figure 17



The plots from Figure 17 are unusual. Both box plots have a median of zero. Reported Fraud=Y has a mean of 1,000,000 while Reported Fraud=“N” mean is 918,000. Both histograms have their peak at zero and a log tail to the right.

There are only 7506 data points greater than zero. 2417 for Yes and 5089 for No. Data points greater than zero represent only 26 percent of the entire data set. Normally this would seem unusual, and we would review the raw data for errors. Checking the description of umbrella limit we find that such extreme data points are not uncommon. Umbrella insurance provides “excess liability insurance” beyond the liability insurance already in auto insurance coverage. It’s for expensive situations where medical bills and/or repairs exceed those in “base” auto policies. Auto policy holders who fall in the higher income brackets are usually the purchasers of umbrella limit. Thus, for all data points, the mean of 972,000 and max of 10,000,000 are common values. Additionally, the mean of zero is not unsurprising as not many insured choose umbrella limits for their policies. Due to the small percentage of insured with umbrella limits we will exclude this feature from our models.





Figure 18

Figure 18





Figure 18 displays bar plots of categories belonging to the feature ‘severity of incident’ stacked based on whether fraud is ‘Y’ or ‘N’. ‘Major Damage’ stands out as 60% of claims are reported as fraud whereas the other categories have claims reported as fraud under 16%.







Figure 19

Figure 19

Figure 20

Figure 20







Figure 21

Figure 21









Figure 19 presents each vehicle make with percentages reported as fraud “Y” and “N”. We find that Volkswagen, Mercedes, Ford, BMW, and Audi are the vehicle makes with reported fraud over 30%. This is an interesting statistic though due to the large number of categories we’ll explore the ‘Vehicle Make’ feature further.



Box plots from Figure 20 show the median total claims is roughly the same for all models.



We find from Figure 21 that Nissan, Subaru, and Toyota have a median capital gain near 20,000, substantially larger than all other makes. The vehicle makes over 30% reported fraud from Figure 19 all have zero medians.

Due to the number of categories of “Vehicle Make” we will exclude it from the modeling process.







Figure 22

Figure 22

Figure 23

Figure 23



Figure 24

Figure 24





From Figure 22 we find there are four incident states in which claims reported as fraud is over 30%. State3 has over 40% claims reported as fraud.

Figure 23 and Figure 24 do not exhibit differences of vehicle makes in regards to total claims and capital gains.

We’ll retain the ‘Incident State’ feature as it has half the categories as ’Vehicle Make”.





Figure 25

Figure 25



From Figure 25 two categories stand out with respect to reported fraud. ‘Single Vehicle Collision’ and ‘Multi-vehicle collision’ from the feature ‘Type of Incident’ have claims reported as fraud at 31% and 29% respectively. The other two categories are under 14%.





Data Preprocessing



Before model building can start, we’ll need to perform pre-processing. This will entail splitting our data into training, validation, and test sets along with transforming numerical and categorical features into classification friendly formats.



Select features

## The  Target categories: Index(['N', 'Y'], dtype='object'):

Split Dateset



We will separate the data to get predictor features and target features into separate data frames.



The data type of the target feature is categorical. Most machine learning algorithms require numerical data types. The target feature y will transformed to a numeric tpye.







## dtype('int64')



From the two above outputs we can see that the target feature has been converted into binary form though its data type is integer. The data type must be converted back to categorical.



## Target Feature categories as binary:  Int64Index([0, 1], dtype='int64'):



## Target Feature as categorical binary:
## 
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28181 entries, 0 to 28835
## Data columns (total 1 columns):
##  #   Column         Non-Null Count  Dtype   
## ---  ------         --------------  -----   
##  0   ReportedFraud  28181 non-null  category
## dtypes: category(1)
## memory usage: 247.8 KB



## Shape of Predictor Features is (28181, 23):



## Shape of Target Feature is (28181, 1):



The makup of the X data frame is 28836 rows and 26 columns. The y data frame has the same number of rows, 28836, and one column, the target feature.





## *********** X Structure***********
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28181 entries, 0 to 28835
## Data columns (total 23 columns):
##  #   Column                      Non-Null Count  Dtype   
## ---  ------                      --------------  -----   
##  0   TypeOfIncident              28181 non-null  category
##  1   TypeOfCollission            28181 non-null  category
##  2   SeverityOfIncident          28181 non-null  category
##  3   AuthoritiesContacted        28181 non-null  category
##  4   IncidentState               28181 non-null  category
##  5   NumberOfVehicles            28181 non-null  category
##  6   BodilyInjuries              28181 non-null  category
##  7   Witnesses                   28181 non-null  category
##  8   AmountOfTotalClaim          28181 non-null  int32   
##  9   InsuredAge                  28181 non-null  int32   
##  10  InsuredGender               28181 non-null  category
##  11  CapitalGains                28181 non-null  int32   
##  12  CapitalLoss                 28181 non-null  int32   
##  13  CustomerLoyaltyPeriod       28181 non-null  int32   
##  14  InsurancePolicyState        28181 non-null  category
##  15  Policy_CombinedSingleLimit  28181 non-null  category
##  16  Policy_Deductible           28181 non-null  int32   
##  17  PolicyAnnualPremium         28181 non-null  float64 
##  18  UmbrellaLimit               28181 non-null  int32   
##  19  InsuredRelationship         28181 non-null  category
##  20  coverageIncidentDiff        28181 non-null  float64 
##  21  dayOfWeek                   28181 non-null  category
##  22  IncidentPeriodDay           28181 non-null  category
## dtypes: category(14), float64(2), int32(7)
## memory usage: 1.8 MB



We will now split X,y into Train, Validation and Test sets







## Shape of X Train (19726, 23):
## Shape of X Valid (4227, 23):
## Shape of X Test (4228, 23):
## Shape of y Train (19726, 1):
## Shape of y Valid (28181, 1):
## Shape of y Test (4228, 1):





y features will be transformed to numpy array



## Shape of y Train rv (19726, 1):
## Shape of y Valid rv (28181, 1):
## Shape of y Test rv (4228, 1):



Next we will transform y features to one dimensional arrays





## Shape of y Train rv (19726,):
## Shape of y Valid rv (4227,):
## Shape of y Test rv (4228,):



From the above output we see that y train, y valid, and y test have been transformed into one dimensional numpy arrays.





Transform Categorical and Numerical features



Our next step is to transform the predictor features into acceptable machine learning formats.

Transformation for numerical features is performed by scaling. Scaling prevents a feature with a range let’s say in the thousands from being considered more important than a feature having a lower range. Scaling places features at the same importance before being applied to a machine learning algorithm. There are different methods used in scaling features, for this analysis we’ll be using standard scaling. Standard scaling transforms the data to have zero mean and a variance of one, thus making the data unitless.

Most machine learning algorithms only accept numerical features which makes categorical features unacceptable in their original form. Thus, we need to encode categorical features into numerical values. The act of replacing categories with numbers is called categorical encoding. For this we will use one-hot encoding. Categorical features are represented as a group of binary features, where each binary feature represents one category. The binary feature takes the integer value 1 if the category is present, or 0 otherwise.





First, we will create transformed train, valid, and test sets for the Logistic Regression model. This entails dropping the first category of each feature during One Hot Encoding.





## ************First Five Rows X_train_lr************
##        scale__AmountOfTotalClaim  ...  ohe__IncidentPeriodDay_night
## 10858                  -1.839897  ...                           0.0
## 10084                   1.167314  ...                           0.0
## 1826                   -1.842925  ...                           0.0
## 27377                   1.142850  ...                           0.0
## 19587                  -1.822924  ...                           0.0
## 
## [5 rows x 63 columns]







## ************First Five Rows X_valid_lr************
##        scale__AmountOfTotalClaim  ...  ohe__IncidentPeriodDay_night
## 16124                  -0.628661  ...                           1.0
## 17479                   0.521933  ...                           0.0
## 6203                   -0.063166  ...                           0.0
## 14876                   0.921840  ...                           0.0
## 493                     0.549704  ...                           0.0
## 
## [5 rows x 63 columns]





## ************First Five Rows X_test_lr************
##        scale__AmountOfTotalClaim  ...  ohe__IncidentPeriodDay_night
## 9540                    0.812948  ...                           0.0
## 17332                   1.046629  ...                           0.0
## 13547                  -0.089223  ...                           1.0
## 21248                   0.672341  ...                           1.0
## 10040                   1.560886  ...                           0.0
## 
## [5 rows x 63 columns]





We see from the first five rows of the train, valid, and test sets that the features have been transformed while at the same time retaining the column feature names.



## Shape of X Train lr (19726, 63):
## Shape of X Valid lr (4227, 63):
## Shape of X Test lr (4228, 63):





Next, we transform training, valid, and test sets for all other models. During OneHotEncoding, the first category will be dropped only if the feature is binary.



Transform X_train



## ************First Five Rows X_train_tr************
##        num__AmountOfTotalClaim  ...  cat__IncidentPeriodDay_night
## 10858                -1.839897  ...                           0.0
## 10084                 1.167314  ...                           0.0
## 1826                 -1.842925  ...                           0.0
## 27377                 1.142850  ...                           0.0
## 19587                -1.822924  ...                           0.0
## 
## [5 rows x 76 columns]







## ************First Five Rows X_valid_tr************
##        num__AmountOfTotalClaim  ...  cat__IncidentPeriodDay_night
## 16124                -0.628661  ...                           1.0
## 17479                 0.521933  ...                           0.0
## 6203                 -0.063166  ...                           0.0
## 14876                 0.921840  ...                           0.0
## 493                   0.549704  ...                           0.0
## 
## [5 rows x 76 columns]



## ************First Five Rows X_test_tr************
##        num__AmountOfTotalClaim  ...  cat__IncidentPeriodDay_night
## 9540                  0.812948  ...                           0.0
## 17332                 1.046629  ...                           0.0
## 13547                -0.089223  ...                           1.0
## 21248                 0.672341  ...                           1.0
## 10040                 1.560886  ...                           0.0
## 
## [5 rows x 76 columns]



## Shape of X Train tr (19726, 76):
## Shape of X Valid tr (4227, 76):
## Shape of X Test tr (4228, 76):



From the shape output we find there are 13 additional columns compared to the logistic regression transformed data.





Models



For the purpose of evaluating model performance, the event of interest we are interested in is if reported fraud is yes. This is considered the positive class. Classification metrics are used to determine how well our models predict the event of interest.



Metrics Definitions

Accuracy-measures the number of predictions that are correct as a percentage of the total number of predictions that are made. As an example, if 90% of your predictions are correct, your accuracy is simply 90%. Calculation: number of correct predictions/Number of total predictions. TP+TN/(TP+TN+FP+FN)

Precision-tells us about the quality of positive predictions. It may not find all the positives but the ones that the model does classify as positive are very likely to be correct. As an example, out of everyone predicted to have defaulted, how many of them actually did default? So within everything that has been predicted as a positive, precision counts the percentage that is correct. Calculation: True positives/All Positives. TP/(TP+FP)

Recall- tells us about how well the model identifies true positives. The model may find a lot of positives yet it also will wrongly detects many positives that are not actually positives. Out of all the patients who have the disease, how many were correctly identified? So within everything that actually is positive, how many did the model successfully to find. A model with low recall is not able to find all (or a large part) of the positive cases in the data. Calculated as: True Positives/(False Negatives + True Positives)

F1 Score-The F1 score is defined as the harmonic mean of precision and recall.

The harmonic mean is an alternative metric for the more common arithmetic mean. It is often useful when computing an average rate. https://en.wikipedia.org/wiki/Harmonic_mean

The formula for the F1 score is the following: 2 times((precision*Recall)/(Precision + Recall))

Since the F1 score is an average of Precision and Recall, it means that the F1 score gives equal weight to Precision and Recall:





Model Training and Validation



Logistic Regression

Paramters



LogisticRegression(C=1, max_iter=10000, solver='newton-cg')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above displays gives us the parameters chosen for the logistic regression model.



Validation Metrics

## ****Logistic RegressionValidation Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.83      0.90      0.86      3097
##            1       0.64      0.50      0.56      1130
## 
##     accuracy                           0.79      4227
##    macro avg       0.73      0.70      0.71      4227
## weighted avg       0.78      0.79      0.78      4227
Figure 26

Figure 26



Feature Importance



We will now look at Feature Importance. Feature Importance is a score assigned to the features of a Machine Learning model that defines how “important” is a feature to the model’s prediction.



For the logistcal regression model we took the absolute value of the coefficients so as to get the Importance of the feature both with negative and positive effect.



Now that we have the importance of the features we will now transform the coefficients for easier interpretation. The coefficients are in log odds format. We will transform them to odds-ratio format.





## ******************Top Five Coefficients******************
##                                     Feature  Exp_Coefficient
## 50       ohe__InsuredRelationship_unmarried         1.634025
## 48  ohe__InsuredRelationship_other-relative         1.521623
## 34                         ohe__Witnesses_2         1.503439
## 47   ohe__InsuredRelationship_not-in-family         1.467964
## 58     ohe__IncidentPeriodDay_early morning         1.353613





Support Vector Machine



Support Vector Machine (the “road machine”) is responsible for finding the decision boundary to separate different classes and maximize the margin. A decision boundary differentiates two classes. A data point falling on either side of the decision boundary can be attributed to different classes. Binary classes would be either yes or no.



Training





SVC(C=1, gamma=0.1, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above displays gives us the parameters chosen for the support vector machine Model.



Validation Metrics





## **********SVC Validation Classification Report**********
##               precision    recall  f1-score   support
## 
##            0       0.92      0.98      0.95      3097
##            1       0.93      0.76      0.84      1130
## 
##     accuracy                           0.92      4227
##    macro avg       0.92      0.87      0.89      4227
## weighted avg       0.92      0.92      0.92      4227







Decision Tree



Decision trees use a flowchart like a tree structure to show the predictions that result from a series of feature splits. To accomplish this, a decision tree is made up of three types of nodes:

  • Root Node (parent node): The node that starts the graph. It evaluates the variable that best splits the data.

  • Intermediate Nodes (child nodes): These are nodes where features are evaluated for further splits of the data but are not the final nodes.

  • Leaf nodes (terminal nodes): These are the final nodes of the tree, where the prediction of a categorical event are made.

For a more detailed explanation of decision trees check the link below.

Guide to Decision Trees



Training



Parameter Definitions:

min_samples_split - Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.

min_samples_leaf - The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.Defines the minimum samples (or observations) required in a terminal node or leaf.

max_depth - Determines the length of the tree which is the same as the number of splitting rounds. T

max_features - The number of features to consider when looking for the best split



DecisionTreeClassifier(max_depth=1000, max_features=0.5, min_samples_leaf=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above displays gives us the parameters chosen for the Decision Tree Model.





Validation Metrics



## ****Decision Tree Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.73      0.74      0.74      3097
##            1       0.27      0.26      0.26      1130
## 
##     accuracy                           0.61      4227
##    macro avg       0.50      0.50      0.50      4227
## weighted avg       0.61      0.61      0.61      4227
Figure 37

Figure 37



Feature Importance





Figure 28

Figure 28





Random Forest



Random forest is an ensemble learning method. Ensemble learning takes predictions from multiple models are merges them to enhance the accuracy of prediction. There are four types of ensemble techniques. We’ll be using Bagging (which random forest is an example of) and boosting, which our next four models will be an example of.

Bagging involves fitting many decision trees on different samples of the same dataset and averaging the predictions.

Random Forest models are made up of individual decision trees whose predictions are combined for a final result. The final result is decided using majority rules which means that the final prediction is what the majority of the decision tree models chose. Random Forests can be made up of thousands of decision trees.

Simply put, the random forest builds multiple decision trees and merges them together to get a more accurate prediction.

Random Forest for Beginners





The individual models of random forest are decision trees. The decision trees predictions are combined for reaching a result. Majority rules is the process by which the Random Forest Classifier determines the result. An example would be 5 models in which 3 of the 5 models predict ‘yes’ for the classification problem.



Training



Parameter Definitions (Not previously defined):

n_estimators - the number of trees to construct



RandomForestClassifier(max_depth=1000, max_features=0.5, min_samples_leaf=3,
                       min_samples_split=6, n_estimators=5000, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above display presents the parameters chosen for the random Forest classifier.





Validation Metrics



## ****Random Forest Validation Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.90      0.97      0.93      3097
##            1       0.91      0.70      0.79      1130
## 
##     accuracy                           0.90      4227
##    macro avg       0.90      0.84      0.86      4227
## weighted avg       0.90      0.90      0.90      4227
Figure 29

Figure 29





Feature Importance





Figure 30

Figure 30



AdaBoost



Boosting learns from the mistakes of individual trees. Each new tree is built from the previous tree. We’ll be using five boosting algorithms, the first being AdaBoost.

In AdaBooost, a new tree adjusts based on the previous tree by adjusting its weights based on errors from that previous tree. Observations have an assigned weight, and each tree is built in an additive manner, assigning greater weights (more importance) to misclassified observations in the previous learners. Misclassified would be predicted yes but actual is no. False Errors from previous trees are weak learners. Weak learners perform no better than a random guess.



Training



Parameter Definitions (Not previously defined)

Learning rate - Shrinks the contribution of individual trees for each round of boosting so that no tree has too much influence. Basically, learning rate limits the influence of individual trees. By lowering the learning rate, more trees are required to produce better scores. Lowering learning rate prevents over fitting because the size of weights carried forward is smaller.









AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
                   learning_rate=1, n_estimators=5000, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above display presents the paramters chosen for the AdaBoost model.



Validation Metrics





## ****AdaBoost Validation Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.91      0.98      0.94      3097
##            1       0.92      0.72      0.81      1130
## 
##     accuracy                           0.91      4227
##    macro avg       0.91      0.85      0.87      4227
## weighted avg       0.91      0.91      0.90      4227
Figure 31

Figure 31





Feature Importance



Figure 32

Figure 32



Gradient Boosting



Gradient boosting also uses incorrect predictions from previous trees to adjust the next tree though this is accomplished by fitting each new tree based on the errors of the previous tree’s predictions. Mistakes from the previous trees are used to build a new tree solely around these mistakes. As mentioned early in AdaBoost, gradient boosting is taking these errors (weak learner) and making them a strong learner. The difference is the gradient boost algorithm only uses the errors from the previous tree in contrast to AdaBoost.

The main idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model. Errors are reduced by building a new model on the errors or residuals of the previous model.



Training



parameter definitions (not previously defined)

Criterion - The loss function used to find the optimal feature and threshold to split the data

base learner - Is the initial decision tree. It’s the first leaner in the process Subsample - A subset of samples. A subset of rows means not all rows may be included when building each tree. The percentage of each boosting round is limited.

max_leaf_nodes - The maximum number of terminal nodes or leaves in a tree.

n_iter_no_change - is used to decide if early stopping will be used to terminate training when validation score is not improving.

tol - Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.

ccp_alpha - Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen









GradientBoostingClassifier(max_depth=15, max_features=9, min_samples_leaf=60,
                           min_samples_split=1000, n_estimators=4000,
                           subsample=0.7, warm_start=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above display presents the parameters chosen for the gradient boost model.



Validation Metrics





## ****Gradient Boosting Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.92      0.97      0.94      3097
##            1       0.89      0.75      0.82      1130
## 
##     accuracy                           0.91      4227
##    macro avg       0.90      0.86      0.88      4227
## weighted avg       0.91      0.91      0.91      4227
Figure 33

Figure 33





Feature Importance





Figure 34

Figure 34





Extreme Gradient Boosting



Extreme Gradient boosting is similar to gradient boosting with a few improvements. First, enhancements make it faster than other ensemble methods. Secondly, built-in regularization allows it to have an advantage in accuracy. Regularization is the process of adding information to reduce variance and prevent over fitting.





Training



Paramter Definitions (Not previously defined):









XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.3, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0.1, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=10, max_leaves=None,
              min_child_weight=5, missing=nan, monotone_constraints=None,
              n_estimators=5000, n_jobs=-1, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above display presents the parameters chosen for the XGBoost model.





Validation Metrics



## ****XGBoost Validation Report Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.93      0.95      0.94      3097
##            1       0.86      0.81      0.83      1130
## 
##     accuracy                           0.91      4227
##    macro avg       0.90      0.88      0.89      4227
## weighted avg       0.91      0.91      0.91      4227
Figure 35

Figure 35





Feature Importance





Figure 36

Figure 36



Light GBM



Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithms grow level-wise (horizontally). It will choose the leaf with max delta (change) loss to grow.





parameter definition (not previously defined)



Training



LGBMClassifier(colsample_bytree=0.7, learning_rate=0.01, max_depth=6,
               metric='None', min_child_samples=1, n_estimators=5000,
               random_state=314, subsample=0.8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



The above display presents the pararmeters chosen for the light gradient boost model



Validation Metrics





## ****Light Gradient Boosting Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.91      0.98      0.94      3097
##            1       0.93      0.74      0.82      1130
## 
##     accuracy                           0.92      4227
##    macro avg       0.92      0.86      0.88      4227
## weighted avg       0.92      0.92      0.91      4227





Feature Importance





Figure 37

Figure 37



Validation Score Comparison









Figure 39

Figure 39





Figure 40

Figure 40



From the F1 scores above we find that the logistic regression and decision tree models are below average in predicting the event of interest. The other models have F1 scores that suggest above average capability in predicting the event of interest.

We’ll look deeper into the scores by evaluating precision and recall. Precision is the metric we’ll focus on as the higher the score the lower false positives are predicted. We do not wish to inaccurately accuse a policy holder of fraud when there is none. On the other hand, a higher recall means lower false negatives (predicting no when it is yes), which we are not concerned with for this analysis.

The extreme gradient boost model has the best balance between precision and recall. It’s the only model with a recall score above 0.80. The higher recall score comes at the sacrifice as its precision score is slightly lower compared to the other models. We can visualize this by viewing the confusion matrix for each model. Viewing Figure 35 we see that the extreme gradient model classified 912 insurance claims (lower left box) as fraudulent that were fraudulent. This is considered a true positive. The next highest model with correct fraudulent classified claims is gradient boost with 852 (Figure 33.) Alternatively, the extreme gradient boost model classified 147 claims as fraudulent (upper right box) that were not fraudulent. This is considered a false negative. Random forest (Figure 29), gradient boost (Figure 35), ada boost (Figure 31), and light gradient boost (Figure 37) all have lower false positives at 79, 104, 72, and 67 respectively.

Interestingly, the extreme gradient boost model’s important features are different than the other ensemble models. While all have one significantly important feature, extreme gradient boost has no other features of importance over three percent (Figure 36). This differed even from gradient boost (Figure 34) which essentially is the same algorithm. This may be play as part in the ballade of recall and precision scores.

The remaining models have a precision score above .89. These models prove very good at predicting the event of interest (low false positives). A score of 0.90 means only 1 out of 10 policy holders are incorrectly predicted of fraudulent .





Test Set



We are now going to take models with a precision score above .80 and fit to unseen data (test set)



Support Vector Machine



## ******SVC Test Set Classification Report******
##               precision    recall  f1-score   support
## 
##            0       0.92      0.98      0.95      3053
##            1       0.93      0.79      0.86      1175
## 
##     accuracy                           0.93      4228
##    macro avg       0.93      0.88      0.90      4228
## weighted avg       0.93      0.93      0.92      4228





Random Forest



## ******Random Forest Test Set Classification Report******
##               precision    recall  f1-score   support
## 
##            0       0.91      0.97      0.94      3053
##            1       0.90      0.74      0.81      1175
## 
##     accuracy                           0.90      4228
##    macro avg       0.90      0.85      0.87      4228
## weighted avg       0.90      0.90      0.90      4228



Ada Boost





## ******AdaBoost Test Set Classification Report******
##               precision    recall  f1-score   support
## 
##            0       0.91      0.97      0.94      3053
##            1       0.90      0.74      0.81      1175
## 
##     accuracy                           0.91      4228
##    macro avg       0.90      0.86      0.88      4228
## weighted avg       0.90      0.91      0.90      4228





Gradient Boosting





## ******Gradient Boost Test Set Classification Report******
##               precision    recall  f1-score   support
## 
##            0       0.92      0.96      0.94      3053
##            1       0.87      0.78      0.82      1175
## 
##     accuracy                           0.91      4228
##    macro avg       0.89      0.87      0.88      4228
## weighted avg       0.90      0.91      0.90      4228





Extreme Gradient Boosting





## ******Extreme Gradient Boost Test Set Classification Report******
##               precision    recall  f1-score   support
## 
##            0       0.93      0.94      0.94      3053
##            1       0.85      0.83      0.84      1175
## 
##     accuracy                           0.91      4228
##    macro avg       0.89      0.89      0.89      4228
## weighted avg       0.91      0.91      0.91      4228





Light LGB





## ******Light Gradient Boost Test Set Classification Report******
##               precision    recall  f1-score   support
## 
##            0       0.92      0.97      0.94      3053
##            1       0.91      0.77      0.83      1175
## 
##     accuracy                           0.91      4228
##    macro avg       0.91      0.87      0.89      4228
## weighted avg       0.91      0.91      0.91      4228





Metric Comparison







Figure 41

Figure 41





The suport vector machine test precision score remained the same as its validation precision score. The ensemble models had small decreases in their test precision scores compared to their validation scores. This indicates possible slight over fitting.



Selected Best Models



We will build models usiing algorithms from the previous models with precision scores above 0.88 and evaluate them with cross validation.



Pre-processing





Split data



## *****X_cv Shape*****
## (28181, 23)



## *****y_cv Shape*****
## (28181, 1)



Split data Train/Test



## *****X_train_cv Shape*****
## (19726, 23)



## *****X_test_cv Shape*****
## (8455, 23)



## *****y_train_cv Shape*****
## (19726, 1)



## *****y_test_cv Shape*****
## (8455, 1)



## *****y_train_cv Type*****
## <class 'pandas.core.frame.DataFrame'>



## *****X_test_cv Type*****
## <class 'pandas.core.frame.DataFrame'>







## *****y_train_cv_np Shape*****
## (19726, 1)



## *****y_test_cv_np Shape*****
## (8455, 1)





## *****y_train_cv_rv Shape*****
## (19726,)



## *****y_test_cv_rv Shape*****
## (8455,)



Transform Categorical and NumericaL features







## *****X_train_cv_tr Shape*****
## (19726, 76)







## *****X_test_cv_tr Shape*****
## (8455, 76)



Cross Validation





For the next models we will use cross validation on our training set.



Cross validation partitions the training set into equal subsets. The subsets will be used to assess a models performance on training data.

The process works by setting aside the first fold as a test set and the remaining subsets are used as the aggregated training set. The model is trained on the aggregated training set then the performance is evaluated on the testing set. This will continue until all folds have been held out as a test set. An evaluation metric is calculated for each iteration then averaged together. This results in a cross validated metric.

This allows us to evaluate our model on different test sets without having to expose the model to the actual test set.



Stratified fold includes the same percentage of target values in each fold. This will set the number of folds used in our cross validation which for this analysis will be five.



Support Vector Machine





SVC(C=1, gamma=0.1, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Random Forest









RandomForestClassifier(max_depth=1000, max_features=0.25, min_samples_leaf=3,
                       min_samples_split=4, n_estimators=5000, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.





AdaBoost





AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
                   learning_rate=1, n_estimators=10000, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Gradient Boost



GradientBoostingClassifier(learning_rate=0.075, max_depth=7, max_features=11,
                           min_samples_leaf=50, min_samples_split=1000,
                           n_estimators=1000, subsample=0.75, warm_start=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Light LGB









LGBMClassifier(colsample_bytree=0.9, learning_rate=0.01, max_depth=6,
               metric='None', min_child_samples=10, n_estimators=7500,
               random_state=314, subsample=0.7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.





Cross validation Score Comparison







Figure 42

Figure 42





Test Data



Support Vector Machine



## ******SVC Final Test Set Classification Report******
##               precision    recall  f1-score   support
## 
##            0       0.92      0.98      0.95      6164
##            1       0.92      0.78      0.85      2291
## 
##     accuracy                           0.92      8455
##    macro avg       0.92      0.88      0.90      8455
## weighted avg       0.92      0.92      0.92      8455



Random Forrest

## *******Random Forest Final Test Set Classification Report********
##               precision    recall  f1-score   support
## 
##            0       0.90      0.97      0.93      6164
##            1       0.90      0.71      0.80      2291
## 
##     accuracy                           0.90      8455
##    macro avg       0.90      0.84      0.87      8455
## weighted avg       0.90      0.90      0.90      8455





AdaBoost

## *******AdaBoost Final Test Set Classification Report********
##               precision    recall  f1-score   support
## 
##            0       0.91      0.97      0.94      6164
##            1       0.90      0.73      0.81      2291
## 
##     accuracy                           0.91      8455
##    macro avg       0.90      0.85      0.87      8455
## weighted avg       0.91      0.91      0.90      8455





Gradient Boosting







## *******Gradient Boost Final Test Set Classification Report********
##               precision    recall  f1-score   support
## 
##            0       0.92      0.97      0.94      6164
##            1       0.90      0.77      0.83      2291
## 
##     accuracy                           0.91      8455
##    macro avg       0.91      0.87      0.88      8455
## weighted avg       0.91      0.91      0.91      8455





Light Gradient Boost

## *****Light Gradient Boost Final Test Set Classification Report******
##               precision    recall  f1-score   support
## 
##            0       0.92      0.97      0.94      6164
##            1       0.90      0.77      0.83      2291
## 
##     accuracy                           0.91      8455
##    macro avg       0.91      0.87      0.89      8455
## weighted avg       0.91      0.91      0.91      8455





Test Scores







Figure 43

Figure 43





Feature Importance



Random Forest



Figure 44

Figure 44





AdaBoost





Figure 45

Figure 45



Gradient Boosting





Figure 46

Figure 46



Light Gradient Boosting







Figure 47

Figure 47



Final Model Metrics



There was no change between cross validation and test precision scores for support vector machines and AdaBoost models (Figures 42 and 43). The models random forest, gradient boost, and light gradient boost improved by one percent in the test set precision score compared to the cross validation set (Figures 42 and 43). Based on the cross validation and test precision scores we can conclude there is no overfitting of our final models.



Insights



Our final models proved very good at classifying fraudulent claims. In addition, all models minimized incorrectly classifying non-fraudulent claims as fraudulent as only one out of ten claims were incorrectly classified. For predicting on outside unseen data, any of our final models will perform well. Support vector machines proved best at classifying fraudulent claims albeit only by a two percent margin over the other models.

For interpretability, the ensemble models may be a better choice. Each has common features in their top five most important. However, the ensemble models are split when comparing the effect of the highest important feature. Random forest (Figure 44) and gradient boost (Figure 46) have the categorical feature “Severity of Incident-Major Damage” as the most important feature. This feature’s importance is twice that of the next closest feature. In comparison, Adaboost (Figure 45) and light gradient boost (Figure 47) have as their top five important features all numerical features. Additionally, these models’ importance’s are grouped closer together such that no one feature dominates the classification of events. These differences may not be important as all perform well in the task at hand.